Statistical Models for Linguistic Variation in Social Media
نویسندگان
چکیده
Language on the Internet and social media varies due to time, geography and social factors. For example, consider an online chat forum where people from different regions across the world interact. In such scenarios, it is important to track and detect regional variation in language. A person from the UK, who is in conversation with someone from the USA could say “he is stuck in the lift” to mean ”he is stuck in an elevator”, since the word lift means an elevator in the UK. Note that in the US, lift does not refer to an elevator. Modeling such variation can allow for applications to prompt or suggest the intended meaning to the other participants of the conversation. In this thesis I conduct two related lines of inquiry focusing on (a) language itself and the variation it manifests and (b) the user and what we can infer about them based on their language use on social media. First I develop computational methods to track and detect changes in word usage, including semantic and syntactic variation. I examine two modalities: time and geography. Specifically I outline methods to use distributional word representations (word embeddings) to detect semantic variation in word usage. Our methods are scalable to large datasets, making them particularly suited for social media. Second, I turn my attention towards users. In particular, I seek to model latent personality traits of users based on their language use on social media. I propose to develop generative latent factor models, that explicitly seek to build representations of each user based on their inferred latent personality traits. These models seek to capture latent personality traits that serve as useful co-variates for a wide variety of tasks like predicting what topics users like on social media and the number of friends in their social circle. This work has broad applications in several fields like information retrieval, semantic web applications, socio-variational linguistics, computational social science including digital health care, psycho-linguistics and ad-targeting. April 18, 2016 DRAFT
منابع مشابه
The Role of Sociolinguistics in Second Language Acquisition
Learning a new language also involves learning a broad system of norms for social relations.This study broadly showed how EFL learners’ speech act is conveyed from their nativecultures when they are communicating in English and demonstrated that there are somepossibilities of cross-cultural misunderstanding when interlocutors are engaged in the speechact of complimenting with native speakers of...
متن کاملA Latent Variable Model for Geographic Lexical Variation
The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional ...
متن کاملIdentifying Dogmatism in Social Media: Signals and Models
We explore linguistic and behavioral features of dogmatism in social media and construct statistical models that can identify dogmatic comments. Our model is based on a corpus of Reddit posts, collected across a diverse set of conversational topics and annotated via paid crowdsourcing. We operationalize key aspects of dogmatism described by existing psychology theories (such as over-confidence)...
متن کاملGender identity and lexical variation in social media
We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various styles and topical interests. Many cl...
متن کاملThe Third International Workshop on Natural Language Processing for Social Media
Social media is sometimes described as a new domain, genre, or task for natural language processing. This suggests that it has specific properties that distinguish it from other sources of text. I will argue that there are exactly two such properties: variation and change. NLP research has historically focused on genres such as newstext, where there is strong pressure towards standardization. F...
متن کامل